Matching-Based Allocation Strategies for Improving Data Locality of Map Tasks in MapReduce

نویسندگان

  • Olivier Beaumont
  • Thomas Lambert
  • Loris Marchal
  • Bastien Thomas
چکیده

MapReduce is a well-know framework for distributing data-processing computations on parallel clusters. In MapReduce, a large computation is broken into small tasks that run in parallel on multiple machines, and scales easily to very large clusters of inexpensive commodity computers. Before the Map phase, the original dataset is first split into chunks, that are replicated (a constant number of times, usually 3) and distributed onto the computing nodes. During the Map phase, nodes request tasks and are allocated first tasks associated to local chunks (if any). Communications take place when requesting nodes do not hold any local chunk anymore. In this paper, we provide the first complete theoretical data locality analysis of the Map phase of MapReduce, and more generally, for bag-of-tasks applications that behaves like MapReduce. We show that if tasks are homogeneous (in term of processing time), once the chunks have been replicated randomly on resources with a replication factor larger than 2, it is possible to find a priority mechanism for tasks that achieves a quasi-perfect number of communications using a sophisticated matching algorithm. In the more realistic case of heterogeneous processing times, we prove using an actual trace of a MapReduce server that this priority mechanism enables to complete the Map phase with significantly fewer communications, even on realistic distributions of task durations. Key-words: MapReduce, Analysis of Randomized Algorithms, Matchings, Resource Allocation and Scheduling, Balls-into-bins. ∗ Inria & University of Bordeaux, France † CNRS, LIP, École Normale Supérieure de Lyon, INRIA, France ‡ ENS of Rennes, France Stratégies d’allocations à base de couplages pour améliorer la localité des tâches Map dans MapReduce Résumé : MapReduce est un modèle de programmation très connu pour les applications distribuées de traitement de données sur grappes de calcul. Dans ce modèle, le calcul est découpé en petites tâches qui sont lancées en parallèles sur de nombreux processeurs. Il permet d’utiliser facilement de très grandes grappes de calcul faites de processeurs standards. Avant la première phase de Map, les données sont d’abord découpées en gros fragments, qui sont répliqués et distribués sur la plate-forme. Pendant la phase de Map, les processeurs demandent du travail et se voient alloués en priorité des tâches associées aux fragments locaux (s’il y en a). Lorsque ce n’est pas (ou plus) le cas, des communications ont lieu pour transmettre des fragments de données. Dans ce rapport, nous proposont une étude théorique de la localité des données dans la phase de Map d’une application MapReduce, et plus généralement pour toutes applications de type “sac de tâches” qui se comporte de façon similaire. Nous montrons que si les tâches ont des temps de traitement homogènes, une fois que les fragments ont été répliqués et distribués aléatoirement aux processeurs avec un facteur de réplication plus grand que 2, il est possible de trouver un méchanisme de priorité pour les tâches qui réalise un nombre quasi-parfait de communications en utilisant un algorithme de couplage sophistiqué. Dans le cas plus réalistes de temps de traitement hétérogènes, nous montrons qu’avec des données issues de traces d’exécution de MapReduce, ce méchanisme permet de réaliser la phase de Map avec très peu de communications. Mots-clés : MapReduce, Analyse d’algorithmes randomisés, couplages, allocation de ressource et ordonnancement Data Locality of Map Tasks in MapReduce 3

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scheduling algorithm based on prefetching in MapReduce clusters

Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data ...

متن کامل

Boosting MapReduce with Network-Aware Task Assignment

Running MapReduce in a shared cluster has become a recent trend to process large-scale data analytics applications while improving the cluster utilization. However, the network sharing among various applications can lead to constrained and heterogeneous network bandwidth available for MapReduce applications. This further increases the severity of network hotspots in racks, and makes existing ta...

متن کامل

Morpho: A decoupled MapReduce framework for elastic cloud computing

MapReduce as a service enjoyswide adoption in commercial clouds today [3,23]. Butmost cloud providers just deploy native Hadoop [24] systems on their cloud platforms to provide MapReduce services without any adaptation to these virtualized environments [6,25]. In cloud environments, the basic executing units of data processing are virtual machines. Each user’s virtual cluster needs to deploy HD...

متن کامل

ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters

MapReduce clusters are usually multi-tenant (i.e., shared among multiple users and jobs) for improving cost and utilization. The performance of jobs in a multitenant MapReduce cluster is greatly impacted by the allMap-to-all-Reduce communication, or Shuffle, which saturates the cluster’s hard-to-scale network bisection bandwidth. Previous schedulers optimize Map input locality but do not consid...

متن کامل

Improved Input Data Splitting in MapReduce

The performance of MapReduce greatly depends on its data splitting process which happens before the map phase. This is usually done using naive methods which are not at all optimal. In this paper, an Improved Input Splitting technology based on locality is explained which aims at addressing the input data splitting problems which affects the job performance seriously. Improved Input Splitting c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017